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Abstract 

Complex  mechanisms,  such  as  cell-signaling 
pathways,  consist  of  many  highly  intercon¬ 
nected  components,  yet  they  are  often  de¬ 
scribed  in  disconnected  fragmentary  ways. 

The  goal  of  DRUM  (Deep  Reader  for  Under¬ 
standing  Mechanisms)  is  to  develop  a  system 
that  can  read  papers  and  combine  results  of 
individual  studies  into  a  comprehensive  ex¬ 
planatory  model.  A  first  step  is  to  automati¬ 
cally  extract  relevant  events  and  event  rela¬ 
tionships  from  the  literature.  This  paper  de¬ 
scribes  initial  steps  in  extending  an  existing 
general  deep  language  understanding  system, 
TRIPS,  to  read  biomedical  papers.  In  a  pre¬ 
liminary  evaluation,  our  system  was  the  best 
performing  system  among  the  participants, 
achieving  results  close  to  human  expert  per¬ 
formance.  These  results  suggested  that  our 
system  is  viable  for  complex  event  extraction 
and,  ultimately,  understanding  complex  sys¬ 
tems  and  mechanisms. 

1.  Introduction 

Complex  mechanisms  consist  of  many  highly 
interconnected  components,  yet  they  are  often 
described  in  disconnected  fragmentary  ways. 
Examples  include  ecosystems,  social  dynamics 
and  signaling  networks  in  biology.  The  study  of 
these  complex  systems  is  often  focused  on  a 
small  portion  of  a  mechanism  at  a  time.  In  addi¬ 
tion,  the  huge  volume  of  scientific  literature 
makes  it  difficult  to  track  the  fast  developments 
in  the  field  to  achieve  a  comprehensive  under¬ 
standing  of  the  often  distant  and  convoluted  in¬ 
teractions  in  the  system. 

The  goal  of  the  DRUM  (Deep  Reader  for  Un¬ 
derstanding  Mechanisms)  project  is  to  develop  a 
system  that  can  read  papers  and  combine  re¬ 
search  results  of  individual  studies  into  a  com¬ 
prehensive  explanatory  model  of  a  complex 
mechanism.  The  system  will  automatically  read 
scientific  papers,  extract  relevant  new  model 
fragments,  and  compose  them  into  larger  models 
that  will  expose  the  interactions  and  relationships 
between  disparate  elements  in  the  mechanism. 

A  first  step  towards  this  goal  is  to  automati¬ 
cally  extract  relevant  events  and  event  relation¬ 


ships  from  the  literature.  In  this  paper  we  will 
describe  initial  steps  in  extending  an  existing 
general  deep  language  understanding  system, 
TRIPS  (Allen  et  al,  2008),  to  the  genre  of  scien¬ 
tific  writing,  in  particular  in  the  biomedical  do¬ 
main.  Events  in  biomedical  research  papers  are 
described  in  a  highly  specialized  and  technical 
language,  with  complex  formulations  and  nested 
constructions.  We  will  discuss  adaptations  made 
and  how  the  design  principles  of  TRIPS  facilitate 
such  adaptations. 

We  will  report  on  an  experimental  evaluation 
of  this  extended  system  on  extracting  events  and 
relationships  centered  on  the  Ras  signaling 
pathways  from  a  number  of  text  passages  in  sci¬ 
entific  papers.  Our  system  was  the  best  perform¬ 
ing  system  among  those  evaluated,  achieving 
results  close  to  human  expert  performance. 

Admittedly  this  was  a  small  and  preliminary 
evaluation.  However,  the  results  suggested  our 
system  is  viable  for  complex  event  extraction.  Of 
note,  unlike  typical  statistical  approaches,  we  did 
not  train  on  text  describing  the  Ras  signaling 
pathways  (or  on  any  other  text  for  that  matter). 
Our  results  were  achieved  using  a  general  deep 
language  understanding  system,  with  little 
domain-specific  customization  beyond  the  rec¬ 
ognition  of  named  entities  and  some  specialized 
vocabularies.  Most  important,  our  goal  does  not 
stop  at  the  surface  extraction  of  events,  as  is  the 
case  for  many  existing  bio-event  extraction  tasks. 
With  a  general  deep  language  understanding  sys¬ 
tem,  we  are  in  a  good  position  to  develop  an  un¬ 
derstanding  of  the  underlying  connections  in 
complex  models,  and  the  methods  developed  to 
achieve  that  understanding  will  be  readily  trans- 
ferrable  to  domains  other  than  biology. 

2.  The  TRIPS  Architecture 

Much  recent  text  processing  work  has  focussed 
on  developing  “shallow”,  statistically  driven 
techniques.  TRIPS  takes  a  different  approach, 
using  statistical  methods  as  a  preprocessing  step 
to  provide  guidance  to  a  deep  parsing  system  that 
uses  a  detailed,  hand-built,  grammar  of  English 
with  a  rich  set  of  semantic  restrictions.  Figure  I 
shows  an  overview  of  the  system  architecture.  In 


Figure  1.  System  Architecture. 
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Figure  2.  The  logical  form  produced  by  DRUM  of  the  sentence  “ASPP2  can  be  phosphorylated 

at  serine  827  by  MAPKl. 


the  rest  of  this  section  we  will  describe  briefly 
the  main  components  of  the  system.  The  content 
extractor,  customized  for  biomedical  text,  will  be 
discussed  in  more  detail  in  Section  4. 

2.1.  Parser 

The  TRIPS  grammar  is  a  lexicalized  context-free 
grammar,  augmented  with  feature  structures  and 
feature  uniflcation.  The  grammar  is  motivated  by 
X-bar  theory  (Jackendoff,  1977),  and  draws  on 
principles  from  GPSG  (Gazdar  et  ah,  1985),  for 
example  head  and  foot  features,  and  HPSG  (Pol¬ 
lard  and  Sag,  1987,  1994).  The  search  in  the 
parser  is  controlled  by  a  set  of  hand-built  prefer¬ 
ences  encoded  as  weights  on  the  rules,  together 
with  domain-general  selectional  restrictions  (en¬ 
coded  in  the  lexicon  and  ontology)  to  eliminate 
semantically  anomalous  sense  combinations. 

The  TRIPS  parser  uses  a  packed-forest  chart 
representation  and  builds  constituents  bottom-up 
using  a  best- first  search  strategy  similar  to  A*, 
based  on  rule  and  lexical  weights  and  the  influ¬ 
ences  of  the  front  end  components  (described 
below). 

The  parser  constructs  from  the  input  a  logical 
form,  which  is  a  semantic  representation  that 


captures  an  unscoped  modal  logic  (Allen,  1995; 
Manshadi  et  ah,  2008).  The  logical  form  includes 
the  surface  speech  act,  semantic  types,  semantic 
roles  for  predicate  arguments,  and  dependency 
relations.  Consider  the  sentence: 

ASPP2  can  be  phosphorylated  at  serine 
827  by  MAPKL 

Figure  2  is  a  graphical  depiction  of  the  logical 
form  of  this  sentence  produced  by  DRUM.  The 
nodes  in  the  graph  represent  the  word  senses  and 
ontology  types,  together  with  quantification  in¬ 
formation,  and  the  edges  represent  semantic  role 
relations.  Of  particular  interest  are  two  of  the 
core  semantic  roles:  AGENT  (here,  MAPKl), 
identifying  objects  that  play  a  causal  role  in  an 
event;  and  AFFECTED  (here,  ASPP2),  identify¬ 
ing  objects  that  are  changed  as  part  of  an  event. 
Other  roles  also  provide  key  information  that 
needs  to  be  extracted.  For  instance,  LOCATION 
identifies  the  molecular  site  (here,  serine  827)  or 
cellular  location  (e.g.,  cytoplasm)  associated  with 
an  event  of  interest.  The  logical  form  also  cap¬ 
tures  tense,  modality  and  aspect  information, 
which  is  crucial  for  determining,  for  example, 
whether  a  statement  is  a  stated  fact,  a  conjecture 
or  a  possibility  (as  indicated  by  the  modality). 


Figure  3.  A  subset  of  the  TRIPS  upper  event  ontology,  showing  eore  roles 


2.2.  Ontology  and  Lexicon 

The  parser  draws  on  a  general  purpose  semantie 
lexieon  and  ontology  whieh  define  a  range  of 
word  senses  and  lexieal  semantie  relations.  The 
eore  semantie  lexieon  was  eonstrueted  by  hand 
and  eontains  approximately  7,500  lemmas  (gen¬ 
erating  approximately  three  times  that  many 
words)  and  2,000  eoneepts  in  the  ontology. 

The  ontology  is  organized  hierarehieally  and 
eaeh  ontology  eoneept  has  assoeiated  with  it  pos¬ 
sible  semantie  roles  and  seleetional  preferenees 
that  further  refine  the  eoneept.  For  instanee,  it 
ean  be  speeified  that  the  AFFECTED  role  of 
phosphorylate  may  only  take  a  physieal  objeet 
that  is  part  of  a  moleeule  (e.g.,  a  protein  or  a  mo- 
leeular  site).  Figure  3  shows  a  portion  of  the 
TRIPS  upper  ontology  for  events  and  their  asso¬ 
eiated  eore  semantie  roles. 

A  TRIPS  lexieon  entry  is  eomposed  of  two 
key  parts.  The  first  is  the  ontology  type  of  the 
word  sense,  and  it  reeeives  the  roles  and  restrie- 
tions  speeifie  to  its  ontology  type  together  with 
those  inherited  from  its  aneestors  in  the  ontology 
hierarehy.  The  seeond  is  the  grammatieal  eon- 
struetions  that  the  word  ean  partieipate  in,  in  the 
form  of  rules  that  map  syntaetie  patterns  to  in¬ 
stantiations  of  objeets  from  the  ontology. 

2.3.  Extending  the  Lexicon 

To  attain  broad  lexical  coverage  beyond  its  hand- 
defined  core  lexicon,  TRIPS  uses  input  from  a 
variety  of  external  resources,  some  of  which  will 
be  described  in  the  next  sections.  Using  the  built- 
in  subsystem  WordFinder,  TRIPS  can  augment 


its  lexicon  by  dynamically  building  lexical  en¬ 
tries  with  plausible  semantic  and  syntactic  struc¬ 
tures  for  virtually  any  word  in  WordNet  (Fell- 
baum,  1998),  thus  extending  its  coverage  to  over 
100,000  words. 

For  words  not  in  the  core  lexicon,  WordFinder 
uses  a  hand-built  mapping  between  the  hy- 
pernym  information  in  WordNet  (for  all  the 
WordNet  senses)  and  the  TRIPS  ontology.  For 
each  identified  TRIPS  class  it  gathers  all  the  pos¬ 
sible  constructions  that  words  of  this  class  in  the 
TRIPS  lexicon  participate  in.  It  then  generates  a 
set  of  lexical  entries  for  the  unknown  word  by 
combining  each  possible  ontological  class  with 
each  possible  construction  for  that  class.  While 
this  procedure  may  over-generate,  the  key  is  to 
include  the  correct  constructions  among  the  gen¬ 
erated  possibilities,  since  these  correct  construc¬ 
tions  will  be  the  ones  realized  in  parsing  sen¬ 
tences  (for  more  information  see  Allen,  2014). 

2.4.  Front  End  Components 

To  support  more  robust  processing  and  domain 
configurability,  the  core  system  has  the  capabil¬ 
ity  to  incorporate  a  variety  of  statistical  and  sym¬ 
bolic  natural  language  processing  components  in 
the  front  end,  as  well  as  domain-specific  compo¬ 
nents  such  as  specialized  named  entity  recogniz¬ 
ers.  These  include  several  off-the-shelf  natural 
language  tools  such  as  the  Shlomo  Yona  senten- 
cizeE,  the  Stanford  part-of-speech  tagger 
(Toutanova  and  Manning,  2000),  the  Stanford 
named-entity  recognizer  (NER)  (Finkel  et  al., 
2005)  and  the  Stanford  Parser  (Klein  and  Man¬ 
ning,  2003).  The  output  of  these  and  other  spe- 


1  http://search.cpan.ora/^kimrvan/Lingua-EN-Sentence-0.29/ 


Resource 

Entities 

References 

BRENDA  Tissue  Ontology 

tissues,  eell  types,  eell  lines 

Gremse  et  ah,  2011 

Cell  Ontology  (CL) 

eell  types 

Diehl  et  ah,  2011 

Chemieal  Entities  of  Biologieal  Interest 
(ChEBI) 

ehemieals,  moleeule  types,  eell  eomponents 

Degtyarenko  et  ah,  2008 

Gene  Ontology  (GO) 

moleeular  funetions,  biologieal  proeesses,  path¬ 
ways,  eell  eomponents,  maeromoleeular  eomplexes 

Ashbumer  et  ah,  2000 

HUGO  Gene  Nomenelature  (HGNC) 

genes 

Gray  et  ah,  2015 

Medieal  Subjeet  Headings  (MeSH®), 
Supplementary  Coneept  Reeords  (SCR) 

drugs  and  ehemieals 

Lipseomb,  2000 

neXtProt 

eell  lines,  protein  families 

Gaudet  et  ah,  2015 

Pfam 

protein  families 

Finn  et  ah,  2014 

Proteomies  Standards  Initiative  for  Mo- 
leeular  Interaetion  (PSI-MI) 

moleeular  interaetions,  moleeule  type,  maero¬ 
moleeular  eomplexes,  genes  and  proteins,  biologi¬ 
eal  roles,  units  of  measurement 

Hermjakob  et  ah,  2004 

UniProtKB  (Swiss-Prot) 

proteins 

Uniprot  Consortium,  2014 

Table  4.  Sources  of  domain-specific  terminology/concepts  and  the  types  of  entities 
incorporated  into  the  TRIPS  ontology 


cialized  preprocessors  (e.g.,  a  street  address  rec¬ 
ognizer)  is  sent  to  the  parser  as  advice.  The 
parser  then  decides  whether  to  follow  these 
pieces  of  advice  as  it  searches  for  the  optimal 
parse  of  the  sentence. 

3.  Extensions  and  Customization  for  the 
Biomedical  Genre 

We  describe  below  several  extensions  to  the  gen¬ 
eral  TRIPS  system  to  better  handle  the  text  char¬ 
acteristics  of  the  biomedical  literature. 

3.1.  Genre  Specialization 

The  chart  produced  by  the  parser  is  searched  us¬ 
ing  a  dynamic  programming  algorithm  to  find 
the  least  cost  sequence  of  constituents  according 
to  a  cost  table  that  can  be  varied  by  genre.  For 
instance,  in  dialogue  systems  speech  acts  such  as 
CONFIRM  (e.g.,  ok)  or  GREET  (e.g.,  hello)  are 
expected.  For  papers  in  the  biomedical  domain, 
such  speech  acts  almost  never  occur  and  thus  are 
discounted  in  favor  of  TELL  statements.  Simi¬ 
larly,  in  dialogue  systems  utterances  are  expected 
to  be  fairly  short  and  colloquial,  whereas  in  sci¬ 
entific  text  the  sentence  structures  are  expected 
to  be  much  more  formal  and  involved.  The  pa¬ 
rameters  for  parsing  and  the  cost  table  are  set 
accordingly. 

In  addition,  the  system  can  choose  to  incorpo¬ 
rate  different  front  end  components.  For  instance, 
for  the  biomedical  literature  a  street  address  rec¬ 
ognizer  would  not  be  very  useful,  but  a  named 
entity  recognizer  for  protein  names  would  be 
most  crucial. 

Such  customizations  not  only  optimize  the 
parser  efficiency,  but  also  reduce  the  potential 


ambiguities  the  parser  has  to  deal  with,  since 
each  additional  component  offers  additional,  po¬ 
tentially  conflicting,  advice  the  parser  has  to  take 
into  account. 

3.2.  Lexicon  and  Ontology  Enhancements 

The  biomedical  domain  uses  specific  terminol¬ 
ogy  that  is  outside  the  core  TRIPS  lexicon  and 
ontology.  We  extended  the  system’s  coverage  by 
incorporating  domain-specific  terminology,  with 
mappings  to  TRIPS  ontology  classes.  In  some 
cases  we  introduced  new  ontology  categories  to 
accommodate  domain-specific  concepts.  Table  4 
lists  the  resources  used,  as  well  as  the  types  of 
entities  mapped  to  the  TRIPS  ontology.  Some  of 
these  resources  organize  concepts  in  ontologies 
(e.g.,  using  the  OBO  format  (Smith  et  al.,  2007)); 
for  these,  we  grafted  the  relevant  nodes  onto  the 
TRIPS  ontology  (see  Blaylock  et  al.,  2011).  For 
example,  most  GO  biological  processes  are 
mapped  to  the  existing  TRIPS  ontology  category 
ONT::event-of-change;  however,  children  of 
G0:0007165  (signal  transduction)  are  names/ 
types  of  signaling  pathways,  and  they  are 
mapped  to  ONT::signaling-pathway — a  domain- 
specific  category  newly  added  to  the  TRIPS  on¬ 
tology.  Controlled  vocabularies  for  single  entity 
types  (e.g.,  neXtProt’s  Cellosaurus)  were 
mapped  to  single  TRIPS  ontology  types  (e.g., 
ONT::cell-line). 

In  addition,  we  used  the  SPECIALIST  lexicon 
(McCray  et  al.,  1994)  for  obtaining  syntactic 
category  information  about  domain-specific  lexi¬ 
cal  items,  which  is  helpful  during  parsing;  how¬ 
ever,  since  SPECIALIST  does  not  include  se¬ 
mantic  information,  the  lexical  entries  are  not 
mapped  into  the  TRIPS  ontology. 
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:NAME  "phosphorylation"...)) 


(EVENT  ONT::V31830  ONT::REGULATE  :AGENT  ONT::V31826  :AFFECTED  ONT::V31848  TENSE  W::PRES 
:START0  :END  46) 

(EVENT  ONT::V31826  ONT::ACTIVATE  :AFFECTED  ONT::V31823  :START  0  :END  15) 

(EVENT  ONT::V3 1848  ONT::PHOSPHORYLATION  :AFFECTED  ONT::V31845  :DRUM  ((:DRUM  :ID  G0::|0016310| 
:NAME  "phosphorylation"...))  :START  2  5  :END  46) 

(TERM  ONT::V31823  ONT::PROTEIN-FAMILY  :NAMEW::RAS  :DRUM  ((:DRUM  :MEMBER-TYPE  ONT::PROTEIN 
:MEMBERS  (HRAS  NRAS  KRAS)))  :START  0  :END  4) 

(TERM  ONT::V31845  ONT::PROTEIN  :NAME  W::ASPP-2  :DRUM  ((:DRUM  :ID  UP::Q13625))  :START25  :END  31 


Figure  5.  The  logieal  form  of  activation  regulates  ASPP2 phosphorylation.  ” 
and  the  events  and  terms  extraeted  by  DRUM. 


3.3.  Specialized  Constructions 

The  TRIPS  eomponent  WordFinder  ean  eonstruet 
lexieal  entries  for  words  not  explieitly  found  in 
the  eore  lexieon,  using  a  mapping  between 
WordNet  and  the  TRIPS  ontology.  This  meeha- 
nism  provides  broad  eoverage  of  words  in  gen¬ 
eral  use.  However,  certain  “everyday”  words 
have  specialized  usage  in  biology.  For  instance, 
“association”  is  not  just  a  vague  relationship  but 
a  specific  kind  of  binding  between  molecules. 
Some  other  words  are  used  in  idiosyncratic  con¬ 
structions.  For  instance,  “the  protein  localizes  to 
the  nucleus”,  which  means  the  protein  exists  in 
the  nucleus,  required  a  novel  syntactic  template 
(and  semantic  characterization).  These  words 
pose  particular  difficulties  for  our  system  as  our 
automatically  derived  general  constructions 
would  be  inadequate.  For  such  cases  we  often 
have  to  provide  hand-tailored  lexical  entries  with 
appropriate  syntactic  templates  and  semantic  re¬ 
strictions  to  distinguish  the  everyday  and  bio¬ 
logical  senses  of  the  words. 

3.4.  Handling  Nominalizations 

Nominalization  is  prevalent  in  the  biomedical 
genre  (see  for  instance  the  example  sentence  in 
Figure  5).  The  TRIPS  parser  has  a  general 
mechanism  for  handling  verb  nominalizations. 


This  is  enabled  by  the  fact  that  the  ontological 
information  is  identical  between  the  verbal  and 
nominal  forms  of  the  same  event  (e.g.,  dominate 
and  dominance).  The  only  difference  between 
verbal  and  nominal  forms  is  the  grammatical 
linking  rules  involved.  For  instance,  for  verbal 
forms  the  subject  of  a  certain  verb  might  map  to 
the  AGENT  role,  and  the  direct  object  to  the  AF¬ 
FECTED  role.  In  nominalizations,  the  possessive 
would  map  to  the  role  identified  with  the  subject 
of  the  verbal  form,  and  the  object  of  an  of  prepo¬ 
sitional  phrase  would  map  to  the  role  identified 
with  the  direct  object  of  the  verb.  While  there  are 
a  number  of  different  constructions  used  with 
nominals,  they  appear  to  be  generic  across  the 
entire  set  of  nominalizations,  and  a  set  of  a  dozen 
or  so  generic  rules  is  all  that  is  needed.  In  addi¬ 
tion,  virtually  all  adjunct  modifications  (e.g., /or 
three  hours)  apply  equally  well  to  both  verbal 
and  nominal  forms  using  the  same  adverbial 
modification  rules  in  the  grammar. 

4.  Event  Extraction 

From  the  logical  forms  produced  by  the  extended 
TRIPS  parser  we  need  to  extract  the  events  and 
event  relationships  of  interest.  Because  much  of 
the  variation  expected  in  sentence  constructions 
is  handled  by  the  extended  TRIPS  system,  we  are 


rule-activate  (40):  ACTIVATE(AGENT,  AFFECTED)  ^  [ONT::start  ONT::start-object]  (AGENT,  AFFECTED) 
rule-decrease  (20):  DECREASE(AGENT,  AFFECTED)  ^  [ONT::decrease]  (AGENT,  AFFECTED) _ 

Figure  6.  Specification  of  the  extraction  rules  for  two  event  types 


able  to  use  a  relatively  compact  specification  for 
defining  the  events  and  relationships  of  interest, 
while  coping  with  fairly  complex  and  nested 
formulations. 

4.1.  An  Example 

Consider  the  sentence: 

RAS  activation  regulates  ASPP2  phos¬ 
phorylation. 

whose  logical  form  is  depicted  in  Figure  5.  There 
are  three  events  in  this  sentence:  the  central  regu¬ 
lation  event  and  two  nested  events,  activation 
and  phosphorylation,  that  serve  as  the  arguments 
to  the  regulation  event.  The  extractions  of  the 
three  events  are  also  shown  in  Figure  5,  together 
with  the  two  terms,  RAS  and  ASPP2,  involved  in 
the  events.  Note  that  the  word  “activation”  is 
mapped  to  the  TRIPS  ontology  type  ONT:: start. 
It  is  this  ontology  type  that  triggers  the  extraction 
rule  for  an  ACTIVATE  event  (see  Figure  6). 

In  addition  to  the  AGENT  and  AFFECTED 
roles,  the  :DRUM  slot  provides  DRUM-specific 
grounding  information  about  the  events  and  enti¬ 
ties,  mostly  derived  from  bio-resources  (see  Sec¬ 
tion  3.2).  For  example,  UP::Q13625  is  the  Uni- 
Prot  identifier  for  the  protein 

4.2.  Extraction  Rule  Specification 

We  capitalized  on  the  TRIPS  ontology  and  parser 
to  develop  a  compact  and  easy-to-maintain  speci¬ 
fication  of  event  extraction  rules.  Instead  of  hav¬ 
ing  to  write  one  rule  to  match  each  keyword/ 
phrase  that  could  signify  an  event,  many  of  these 
words/phrases  have  already  been  systematically 
mapped  to  a  few  types  in  the  TRIPS  ontology, 
using  a  combination  of  the  TRIPS  internal  lexi¬ 
con  and  the  WordFinder  component  which  al¬ 
lows  us  to  attain  the  coverage  of  WordNet.  For 
instance,  accumulate,  gain,  amplify,  multiply, 
boost,  double,  among  others,  all  map  to  the 
TRIPS  ontology  type  ONT:: increase. 

In  addition,  the  parser  handles  various  surface 
structures,  and  the  logical  form  returned  contains 
normalized  semantic  roles.  For  example, 

RAS  activates  RAF 
RAF  is  activated  by  RAS 
The  activation  of  RAF  by  RAS 
Activated  RAF 
RAF  activation 

all  are  parsed  into  the  same  basic  logical  form 
with  the  semantic  roles  AFFECTED:  RAF  and, 
where  applicable,  AGENT:  RAS.  Thus,  we 


needed  very  few  (often  only  one)  extraction  rule 
specifications  for  each  event  type,  covering  a 
wide  range  of  words  and  syntactic  patterns. 

Finally,  since  most  events  of  interest  are 
events  of  action,  the  usage  patterns  of  these  event 
words  are  often  essentially  identical,  modulo  the 
ontology  types  that  signify  the  events  and  (less 
often)  the  semantic  roles  that  correspond  to  the 
event  arguments.  We  generated  these  rules  using 
a  module  with  standardized  rule  components, 
parameterized  by  only  the  event-specific  ontol¬ 
ogy  types  and  semantic  role  mappings.  For  ex¬ 
ample,  X  activates  /  decreases  /  regulates  / phos- 
phorylates  Y,  though  denoting  different  events, 
all  exhibit  the  same  basic  structure  with  the  main 
semantic  roles  AGENT  and  AFFECTED.  Com¬ 
plements  denoting  for  example  molecular  sites 
and  cellular  locations  for  the  most  part  retain  the 
same  structure  across  event  types. 

Figure  6  shows  the  stylized  specification  of 
two  event  types,  ACTIVATE  and  DECREASE. 
The  ACTIVATE  line  is  read  as  follows: 

name  of  rule:  activate 
priority  of  rule:  40 

name  of  event  to  be  extracted:  ACTIVATE 
semantic  role  1 :  AGENT 
semantic  role  2:  AFFECTED 
ontology  types:  ONT::start;  ONT::start-object 

where  the  rule  priority  determines  which  rule  is 
selected  when  multiple  rules  apply,  and  the  on¬ 
tology  types  are  those  in  TRIPS  that  map  to  the 
target  event  type  (here,  ACTIVATE).  The  seman¬ 
tic  roles  may  have  further  constraints  on  the 
types  that  can  fill  these  roles.  For  instance,  only 
molecular  and  cellular  participants  (e.g.,  pro¬ 
teins,  chemicals,  nucleus)  are  of  interest  in  the 
context  of  biological  events. 

Note  the  similarity  between  the  information 
for  ACTIVATE  and  DECREASE.  The  only  dif¬ 
ference  between  the  two  lines  is  the  ontology 
types  that  represent  the  respective  event  types 
(ONT:: start,  ONT:: start-object  for  the  former  and 
ONT::decrease  for  the  latter).  This  compact  rep¬ 
resentation  makes  it  easy  to  specify,  maintain  and 
update  the  extraction  rules. 

These  rules  were  developed  from  general 
principles  rather  than  based  on  specific  training 
samples  on  the  Ras  signaling  pathways.  They 
were  subsequently  augmented  as  we  learned 
more  about  specific  biological  usages.  Although 
we  do  base  our  rules  on  the  biological  literature, 
we  emphasize  that  neither  the  extraction  rules 
described  above  nor  any  of  the  domain-specific 
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Figure  7.  Overview  of  the  end-to-end  extraetion  and  reasoning  system. 


We  and  others  have  recently  shown  that  ASPP2  can  poten¬ 
tiate  RAS  signaling  by  binding  directly  via  the  ASPP2  N- 
terminus  [2,6].  Moreover,  the  RAS-ASPP  interaction  en¬ 
hances  the  transcription  function  of  p53  in  cancer  cells  [2]. 
Until  now,  it  has  been  unclear  how  RAS  could  affect  ASPP2 
to  enhance  p53  function.  We  show  here  that  ASPP2  is 
phosphorylated  by  the  RAS/Raf/MAPK  pathway  and  that 
this  phosphorylation  leads  to  its  increased  translocation  to 
the  cytosol/nucleus  and  increased  binding  to  p53,  providing 
an  explanation  of  how  RAS  can  activate  p53  pro-apoptotic 
functions  (Figure  5).  Additionally,  RAS/Raf/MAPK  pathway 
activation  stabilizes  ASPP2  protein,  although  the  underly¬ 
ing  mechanism  remains  to  be  investigated. 

Figure  8.  Example  text  passage  for  evaluation. 

enhaneements  to  our  system  diseussed  in  Seetion 
3  are  speeifie  to  the  language  or  meehanisms 
deseribing  the  Ras  signaling  pathways.  Thus,  we 
expeet  our  system  to  have  eomparable  perform- 
anee  on  any  input  deseribing  bio-moleeular 
meehanisms. 


eausal  relationship  (e.g.,  X  inereases  Y).  These 
were  further  elassified  with  respeet  to  the  given 
model  as:  1)  new  meehanism  and  2)  new  rela¬ 
tionship  not  in  the  model;  3)  speeialization  and 

4)  eorroboration  of  information  in  the  model;  and 

5)  eonfliet  with  the  model.  In  addition,  eaeh  re¬ 
sult  was  to  be  aeeompanied  by  the  supporting 
souree  text. 

The  reasoner  aligned  the  extraeted  entities  us¬ 
ing  their  standardized  identifiers  (e.g.,  UniProt, 
HUGO,  Gene  Ontology).  In  addition,  we  derived 
the  relationships  between  the  model  and  text  ex- 
traetions  based  on  the  hierarehieal  organization 
of  the  event  types.  For  instanee,  a  regulation 
event  subsumes  a  stimulation  event,  and  thus  “X 
regulates  Y”  eorroborates  “X  stimulates  Y”  and 
the  latter  is  a  speeialization  of  the  former. 

6.  Results 


5.  System  Evaluation 

We  partieipated  in  a  preliminary  evaluation  of 
event  extraetion,  in  the  eontext  of  “reading  with 
a  model”.  A  biologieal  model  was  given  in  Bio- 
PAX  (Demir  et  al.,  2010),  BEL  (Selventa,  2011), 
and  English.  Given  a  set  of  text  passages  from 
seientifie  papers  on  the  Ras  signaling  pathways, 
the  goal  was  to  extraet  from  these  passages 
events  (and  their  arguments)  that  were  relevant 
to  the  given  model  and  make  explieit  the  links 
between  the  extraeted  events  and  the  model. 

BioPAX  and  BEL  do  not  have  the  linguisti- 
eally  motivated  features  and  expressivity  needed 
for  our  approaeh.  To  minimize  hand  eoding  and 
to  ereate  a  uniform  system,  we  ereated  our  initial 
model  by  reading  and  proeessing  sentenees  sim¬ 
plified  from  the  given  English  model  sentenees, 
using  the  same  proeess  as  for  reading  and  ex- 
traeting  information  from  the  test  passages.  The 
model  entities  and  events  sueh  proeessed  were 
then  eompared  to  the  entities  and  events  ex¬ 
traeted  from  the  text  passages.  Figure  7  shows  an 
overview  of  the  automated  end-to-end  extraetion 
and  reasoning  system. 

Two  types  of  events  were  distinguished  here: 
meehanistie  (e.g.,  X  binds  to  Y)  and  regulatory/ 


Several  passages,  mainly  from  the  results  and 
diseussion  seetions  of  two  seientifie  papers,  were 
seleeted  as  evaluation  inputs.  An  example  pas¬ 
sage,  from  (Godin-Heymann  et  al.,  2013),  is 
shown  in  Figure  8. 

The  extraetions  and  model  eomparisons  were 
manually  seored  by  a  third  party,  based  on  the 
eombined  answers  provided  by  two  separate 
teams  of  biologists  (30  events)  and  the  addition 
of  5  events  adopted  from  system  submissions 
(see  below).  In  “lenient”  seoring  for  preeision, 
ineomplete  results  and  results  that  were  eorreet 
but  irrelevant  were  exeluded,  whereas  in  “striet” 
seoring  these  results  were  eounted  as  ineorreet. 

Eleven  systems  of  varying  degrees  of  automa¬ 
tion  partieipated  in  the  evaluation.  We  have 
available  only  the  lenient  seores  of  other  teams, 
as  shown  in  Figure  9.  For  lenient  seoring  our 
system  was  the  best  performing  system  and  our 
performanee  was  elose  to  human  performanee. 

Note  that  although  the  human  biologists  had 
high  preeision,  there  was  eonsiderable  non¬ 
overlap  between  the  answers  they  provided.  This 
aeeounted  for  the  approximately  0.50  reeall  for 
either  of  the  human  teams,  using  the  pooled  an¬ 
swers  of  the  two  teams  as  the  gold  standard. 


Entirely 

Manual 
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Figure  9.  Evaluation  results  for  eleven  teams.  The  diamond  ♦  represents  the  results  of  our  system. 
The  two  topmost  points  are  the  manual  scores  of  the  two  teams  of  human  biologists. 


Our  precision,  recall  and  FI  results  for  both 
the  lenient  and  strict  scorings  are  as  follows: 


P 

R 

FI 

lenient 

(striet) 

1.00 

(0-67) 

0.40 

0.57 

7.  Analysis 

We  believe  precision  is  much  more  important 
than  recall.  A  high  precision  system  can  generate 
valuable  knowledge  nuggets,  even  if  it  does  not 
have  high  throughput,  whereas  output  from  a 
system  with  high  recall  but  low  precision  cannot 
be  trusted  to  be  accurate.  This  is  especially  the 
case  for  such  information-rich  domains.  Because 
of  the  huge  volume  of  scientific  literature,  infor¬ 
mation  is  likely  to  be  duplicated  in  multiple  pa¬ 
pers,  and  often  also  repeated  in  different  forms  in 
the  same  paper.  Therefore,  extracting  (accu¬ 
rately)  even  a  relatively  small  portion  of  the  in¬ 
formation  in  these  papers  could  amount  to  a  fair 
body  of  knowledge,  even  if  we  cannot  extract 
everything  from  every  sentence. 

Our  system  showed  promising  performance  on 
the  evaluation  data  set.  We  achieved  perfect  pre¬ 
cision,  and  recall  close  to  the  human  experts.  The 
modest  recall  even  for  the  human  experts  indi¬ 
cated  that  this  is  a  fairly  difficult  domain  and 
there  is  not  a  clear-cut  way  to  extract  and  encode 
the  knowledge  represented  in  these  papers.  In 
fact,  after  considering  the  submitted  results,  sev¬ 
eral  additional  events  extracted  by  the  systems 
but  not  by  the  human  experts  were  incorporated 
into  the  gold  standard. 

We  were  able  to  extract  some  fairly  complex, 
nested,  events,  similar  to  the  one  depicted  in 
Figure  5.  The  ontology-based  extraction  and  the 
lexical  coverage  extended  by  WordFinder  al¬ 
lowed  us  to  cope  with  a  variety  of  expressions. 
For  instance,  from  “...  ASPP2  can  potentiate  RAS 
signaling.. F  we  were  able  to  map  “potentiate”  to 


an  INCREASE  event  even  though  “potentiate”  is 
not  in  the  TRIPS  core  lexicon. 

Another  interesting  example  is  “...  monoubiq- 
uitination  abrogates  GAP-mediated  GTP  hy¬ 
drolysis''.  This  fairly  complex  sentence  illus¬ 
trates  some  of  the  strengths  and  weaknesses  of 
our  system.  The  system  was  able  to  extract  two 
interleaving  events: 

evl:  REGULATE(AGENT:  GAP;  AFFECTED:  ev2) 
ev2:  HYDROEYSIS(AFFECTED:  GTP) 

In  the  raw  processing  we  also  had  the  following: 

ev3:  INHIBIT(AGENT:  MONOUBIQUITINATION; 

AFFECTED  ev2) 

but  we  failed  to  identify  what  was  being  monou- 
biquitinated  and  thus  were  not  able  to  include 
this  extraction  in  our  results.  The  answer,  that 
Ras  was  being  monoubiquitinated,  could  only  be 
identified  with  more  sophisticated  discourse 
processing. 

We  identified  several  main  reasons  for  omis¬ 
sions  in  our  extractions:  1)  fragmented  parses 
due  to  the  long  and  complex  sentence  structures 
common  in  scientific  publications;  2)  insufficient 
domain-specific  background  knowledge,  includ¬ 
ing  language  patterns  specific  to  biology;  3)  need 
for  improved  discourse  processing  and  corefer¬ 
ence  resolution;  and  4)  lack  of  inference  capa¬ 
bilities  and  persistent  memory  of  inferences 
made. 

The  last  point  can  be  illustrated  by  the  sen¬ 
tence  “...  the  RAS-ASPP  interaction  enhances  the 
transcription  function  of p53..."  Here  we  need  to 
be  able  to  deduce  that  RAS-ASPP  interaction 
produces  a  complex  of  the  two,  which  then  par¬ 
ticipates  in  further  reactions. 

As  a  final  example,  to  be  able  to  make  sense 
of  the  seemingly  simple  sentence  “We  obtained 
similar  results  using  K-Ras..."  we  need  to  ad¬ 
dress  all  of  the  above  issues.  Due  to  space  limi¬ 
tation  we  will  not  discuss  here  the  ongoing  work 
towards  tackling  these  challenges. 


8.  Related  Work  and  Discussion 

With  the  advent  of  relatively  successful  text  min¬ 
ing  strategies  (named  entity  recognition,  infor¬ 
mation  extraction  and  retrieval)  for  the  recogni¬ 
tion  and  normalization  of  biologically  relevant 
entities,  automatic  extraction  of  more  complex, 
relational  information  from  the  biomedical  litera¬ 
ture  has  become  a  very  active  area  of  research. 
Shared  Tasks  (STs)  such  as  the  Protein-Protein 
Interaction  (PPI)  Task  introduced  at  BioCreative 
II  (Krallinger  et  ah,  2008)  and  the  BioNLP  GE- 
NIA  Event  Extraction  Task  (Kim  et  ah,  2009; 
Kim  et  ah,  2011;  Kim  et  ah,  2013)  have  spurred 
a  lot  of  activity  in  this  area,  although  examples  of 
earlier  work  certainly  exist. 

The  goal  in  the  PPI  task  is  to  extract  binary 
protein-protein  interaction  pairs  from  full-text 
articles.  More  general  biological  events  (e.g., 
regulatory  events)  beyond  PPI  involve  much 
more  varied  relationships  between  entities  and, 
indeed,  between  events  themselves,  leading  to 
complex  nested  structures.  The  BioNLP  STs 
have  evolved  to  include  more  complex  types  of 
events  and  arguments.  The  GENIA  ST  (in  par¬ 
ticular  2013  which  included  coreference)  and  the 
Epigenetics  and  Post-translational  Modifications 
task  (EPI)  introduced  in  2011  (Ohta  et  ah,  2011) 
are  similar  to  our  task.  However,  there  are  sig¬ 
nificant  differences,  too.  We  were  not  provided 
with  gold  annotations  for  entities;  all  relevant 
entities  (including  drugs,  cell  lines,  cell  compo¬ 
nents,  sites)  had  to  be  extracted,  and  most  of 
them  had  to  be  grounded  in  a  reference  database. 
Protein  families  were  also  important,  as  was  the 
relation  between  families  and  the  member  pro¬ 
teins.  Not  only  were  coreferences  supposed  to  be 
resolved,  but,  as  indicated  in  Section  5,  some¬ 
times  complex  inferences  were  required  to  obtain 
a  target  event.  In  summary,  our  task  was  not  de¬ 
signed  to  accommodate  specific  Information  Ex¬ 
traction  (IE)  techniques;  rather,  in  our  evaluation 
the  gold  standard  was  human  performance. 

We  would  like  to  stress  that  our  goal  goes  be¬ 
yond  IE.  The  need  for  deeper  semantic  ap¬ 
proaches  has  been  recognized  before  (see,  e.g., 
Ananiadou  et  ah,  2010).  Still,  the  field  is  domi¬ 
nated  by  ML  classifiers  (for  a  list  of  the  top¬ 
performing  systems  in  the  three  BioNLP  STs 
held  so  far,  see  Ananiadou  et  ah,  2014).  This 
sometimes  results  in  seemingly  paradoxical  re¬ 
sults,  where  systems  can  extract  with  relatively 
good  performance  phosphorylation  events,  but 
not  ubiquitination  events  because  the  training 
data  did  not  contain  enough  examples  of  the  lat¬ 
ter  (Kim  et  ah,  2011). 

Indeed,  ontological  information  is  rarely  used 
in  current  systems.  GenIE  (Cimiano  et  ah,  2005) 


is  an  early  example  of  an  ambitious  ontology- 
driven  system  that  attempts  to  identify  events 
based  on  constructing  a  full  semantic  representa¬ 
tion  of  the  text  (using  a  semantic  lexicon  and 
semantic  restrictions),  as  well  as  relations  be¬ 
tween  events  (using  discourse  information).  The 
ontology  they  used,  however,  was  a  small, 
domain-specific  one.  To  our  knowledge  the  sys¬ 
tem  has  not  been  tested  on  any  of  the  more  recent 
event  extraction  tasks. 

Although  semantic  (deep)  parsing  techniques 
have  been  rarely  used  for  bio-event  extraction, 
we  note  the  PPI  extraction  study  by  Miyao  et  al. 
(2009),  who  found  an  HPSG-based  parser  to 
outperform  (particularly  in  terms  of  precision) 
dependency  and  syntactic  parsers,  especially 
when  trained  on  domain-specific  corpora.  How¬ 
ever,  they  used  the  predicate-argument  structures 
output  by  the  parser  as  additional  features  for  a 
statistical  classifier. 

In  contrast,  we  do  not  depend  on  training  with 
a  domain  specific  corpus  (although  we  have  the 
capability  to  incorporate  modules  that  do);  rather, 
we  extract  events  directly  from  the  predicate- 
argument  structures  represented  in  the  logical 
form,  based  on  linguistic  first  principles  that  can 
be  easily  adapted  to  different  domains.  The  ad¬ 
vantage  of  this  approach  can  be  readily  seen  in 
this  evaluation,  in  which,  with  a  relatively  short 
(but  intensive)  ramp  up,  we  were  able  to  outper¬ 
form  all  other  systems  in  the  extraction  of  com¬ 
plex  events  and  event  relations.  Of  note,  this  was 
despite  the  fact  that  our  system  had  lower  named 
entity  recognition  scores  than  most  others,  par¬ 
ticularly  those  with  a  history  of  participation  in 
biomedical  information  extraction  shared  tasks. 

The  purpose  of  this  evaluation  was  not  a  rig¬ 
orous  ranking  of  the  different  participating  sys¬ 
tems.  Rather,  we  learned  key  areas  we  needed  to 
improve.  The  results  of  this  evaluation  suggested 
that  our  system  is  viable  for  complex  event  ex¬ 
traction.  This  is  however  only  the  first  step  in 
understanding  complex  models  and  mechanisms. 
A  general  deep  language  understanding  system 
that  can  be  extended  with  domain-specific  in¬ 
formation  will  allow  us  to  go  beyond  standard 
surface  extraction  tasks  and  develop  the  capabili¬ 
ties  to  truly  understand  big  and  complex  mecha¬ 
nisms. 
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